Lab book: PCA outputs, spike proteins

Liam Brierley (University of Liverpool)
2020-04-27

PCAs based on genome composition for 5972 coronavirus spike protein sequences. PCAs are colour-coded and ellipses drawn based on different outcome variables, though underlying PCA for each bias type is the same. Mouseover gives outcome variable and virus name.

Dinucleotide bias // genus

Fairly clear clustering and separation of genera!

Dinucleotide bias // human infection capability

Very tight clusters, especially of SARS-CoV and SARS-Cov-2. MERS, SARS, SARS-CoV-2 actually as distinct from each other as they are from other human CoVs. And many close animal viruses…

Codon bias (RSCU) // genus

Separation seems driven almost entirely by stop codon use, alphas preferring TGA, betas and gammas preferring TAA, deltas somewhere in between.

Codon bias (RSCU) // human infection capability

Epidemic coronaviruses are strongly separated from each other, but not too separated from other human viruses again. Virtually all the human viruses prefer TAA, except HCoV-HKU1 doesn’t seem fussy.

Codon bias (RSCU) without stop codon // genus

Codons contribute much more evenly now, strongest being AGA/CGT (Arginine). Gammas/deltas well separated but not others.

Codon bias (RSCU) without stop codon // human infection capability

Human viruses don’t cluster together still, again likely reflects receptor usage as all SARS-like viruses top left

Codon bias (RSCU) without stop codon // human infection capability // PCA3 and PCA4

However looking at PC3 and PC4 (which explain much less overall variation) gives us a much better separation between human and non-human, regardless of receptor…?! No particular amino acid loading strongly here. Good signal to capture..!

Amino acid bias // genus

Clusters, but not as clear a separation here.

Amino acid bias // human infection capability

Surprisingly, the epidemic human coronaviruses are well separated from the endemic human coronaviruses here!